[AMDGPU] Optimize LDS DMA soft waitcnt #138802

kerbowa · 2025-05-07T05:09:38Z

This patch adds support for optimizing S_WAITCNT_VMCNT_LDS_DMA_soft
pseudo instructions by analyzing whether they can be removed based on
the absence of LDS DMA operations.

These optimizations are a precursor to a dependent patch where these
waitcnt pseudos will actually be emitted by the memory legalizer. Adding
the waitcnt in the memory model first without any optimization would be
too painful of a performance penalty.

kerbowa · 2025-05-07T05:09:54Z

[AMDGPU] Handle direct loads to LDS in memory model #142018
[AMDGPU] Optimize LDS DMA soft waitcnt #138802 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-05-07T05:11:09Z

@llvm/pr-subscribers-backend-amdgpu

Author: Austin Kerbow (kerbowa)

Changes

This patch adds support for optimizing S_WAITCNT_VMCNT_LDS_DMA_soft
pseudo instructions by analyzing whether they can be removed based on
the absence of LDS DMA operations.

These optimizations are a precursor to a dependent patch where these
waitcnt pseudos will actually be emitted by the memory legalizer. Adding
the waitcnt in the memory model first without any optimization would be
too painful of a performance penalty.

Full diff: https://github.com/llvm/llvm-project/pull/138802.diff

4 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp (+12)
(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.h (+1)
(modified) llvm/lib/Target/AMDGPU/SOPInstructions.td (+1)
(modified) llvm/test/CodeGen/AMDGPU/lds-dma-waitcnt.mir (+170)

diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index 6f5083acd738d..3e0cf5f35318b 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -1278,6 +1278,18 @@ bool WaitcntGeneratorPreGFX12::applyPreexistingWaitcnt(
     if (Opcode == AMDGPU::S_WAITCNT) {
       unsigned IEnc = II.getOperand(0).getImm();
       AMDGPU::Waitcnt OldWait = AMDGPU::decodeWaitcnt(IV, IEnc);
+
+      // These pseudo waitcnt instructions are only needed to synchronize DS operations
+      // with direct LDS loads that use vmcnt. We can safely relax them when no
+      // outstanding direct LDS loads exist, even if other vmcnt events are pending.
+      if (II.getOpcode() == AMDGPU::S_WAITCNT_VMCNT_LDS_DMA_soft) {
+        unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;
+        AMDGPU::Waitcnt LDSDMAWait;
+        ScoreBrackets.determineWait(LOAD_CNT, RegNo, LDSDMAWait);
+        if (LDSDMAWait.LoadCnt == ~0u)
+          OldWait.LoadCnt = ~0u;
+      }
+
       if (TrySimplify)
         ScoreBrackets.simplifyWaitcnt(OldWait);
       Wait = Wait.combined(OldWait);
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.h b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
index 4b97f58ce92b9..8f32c5223266f 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.h
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.h
@@ -1010,6 +1010,7 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
   static unsigned getNonSoftWaitcntOpcode(unsigned Opcode) {
     switch (Opcode) {
     case AMDGPU::S_WAITCNT_soft:
+    case AMDGPU:: S_WAITCNT_VMCNT_LDS_DMA_soft:
       return AMDGPU::S_WAITCNT;
     case AMDGPU::S_WAITCNT_VSCNT_soft:
       return AMDGPU::S_WAITCNT_VSCNT;
diff --git a/llvm/lib/Target/AMDGPU/SOPInstructions.td b/llvm/lib/Target/AMDGPU/SOPInstructions.td
index 3d3f1ba3f5170..a6a080a3574db 100644
--- a/llvm/lib/Target/AMDGPU/SOPInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SOPInstructions.td
@@ -1608,6 +1608,7 @@ let OtherPredicates = [HasImageInsts] in {
   def S_WAIT_DSCNT_soft : SOPP_Pseudo <"s_soft_wait_dscnt", (ins s16imm:$simm16), "$simm16">;
   def S_WAIT_KMCNT_soft : SOPP_Pseudo <"s_soft_wait_kmcnt", (ins s16imm:$simm16), "$simm16">;
 }
+def S_WAITCNT_VMCNT_LDS_DMA_soft : SOPP_Pseudo <"s_soft_waitcnt" , (ins SWaitCnt:$simm16), "$simm16">;
 
 def S_SETHALT : SOPP_Pseudo <"s_sethalt" , (ins i32imm:$simm16), "$simm16",
     [(int_amdgcn_s_sethalt timm:$simm16)]>;
diff --git a/llvm/test/CodeGen/AMDGPU/lds-dma-waitcnt.mir b/llvm/test/CodeGen/AMDGPU/lds-dma-waitcnt.mir
index 21372c06d3223..71e7a9f29689d 100644
--- a/llvm/test/CodeGen/AMDGPU/lds-dma-waitcnt.mir
+++ b/llvm/test/CodeGen/AMDGPU/lds-dma-waitcnt.mir
@@ -117,3 +117,173 @@ body:             |
     S_ENDPGM 0
 
 ...
+
+# Soft waitcnt should be honored here.
+# GCN-LABEL: name: buffer_load_dword_lds_ds_read_soft_wait
+# GCN:      BUFFER_LOAD_DWORD_LDS_IDXEN
+# GCN-NEXT: S_WAITCNT 3952
+#                     vmcnt(0)
+# GCN-NEXT: S_BARRIER
+---
+name: buffer_load_dword_lds_ds_read_soft_wait
+body:             |
+  bb.0:
+    $m0 = S_MOV_B32 0
+    BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
+    S_WAITCNT_VMCNT_LDS_DMA_soft 3952
+    S_BARRIER
+    $vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
+    S_ENDPGM 0
+
+...
+
+# No need for waitcnt.
+# GCN-LABEL: name: buffer_store_lds_dword_ds_read_soft_wait
+# GCN:      BUFFER_STORE_LDS_DWORD
+# GCN-NEXT: S_BARRIER
+---
+name: buffer_store_lds_dword_ds_read_soft_wait
+body:             |
+  bb.0:
+    $m0 = S_MOV_B32 0
+    BUFFER_STORE_LDS_DWORD $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(3) poison` + 4), (store (s32) into `ptr addrspace(1) poison` + 4)
+    S_WAITCNT_VMCNT_LDS_DMA_soft 3952
+    S_BARRIER
+    $vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
+    S_ENDPGM 0
+
+...
+
+# Soft waitcnt should mean vmcnt(1) before the barrier and vmcnt(0) after.
+# GCN-LABEL: name: series_of_buffer_load_dword_lds_ds_read_soft_wait
+# GCN:      BUFFER_LOAD_DWORD_LDS_IDXEN
+# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
+# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
+# GCN-NEXT: S_WAITCNT 3953
+#                     vmcnt(1)
+# GCN-NEXT: S_BARRIER
+# GCN-NEXT: S_WAITCNT 3952
+#                     vmcnt(0)
+# GCN-NEXT: DS_READ_B32_gfx9
+---
+name: series_of_buffer_load_dword_lds_ds_read_soft_wait
+body:             |
+  bb.0:
+    $m0 = S_MOV_B32 0
+    BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 0, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison`), (store (s32) into `ptr addrspace(3) poison`)
+    BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
+    BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 8, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 8), (store (s32) into `ptr addrspace(3) poison` + 8)
+    S_WAITCNT_VMCNT_LDS_DMA_soft 3953
+    S_BARRIER
+    $vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
+    S_ENDPGM 0
+
+...
+
+# No waitcnt before the barrier because counter is too high
+# GCN-LABEL: name: buffer_load_dword_lds_ds_read_soft_wait_redundant
+# GCN:      BUFFER_LOAD_DWORD_LDS_IDXEN
+# GCN-NEXT: S_BARRIER
+# GCN-NEXT: S_WAITCNT 3952
+#                     vmcnt(0)
+# GCN-NEXT: DS_READ_B32_gfx9
+---
+name: buffer_load_dword_lds_ds_read_soft_wait_redundant
+body:             |
+  bb.0:
+    $m0 = S_MOV_B32 0
+    BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
+    S_WAITCNT_VMCNT_LDS_DMA_soft 3953
+    S_BARRIER
+    $vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
+    S_ENDPGM 0
+
+...
+
+# Combine waitcnt.
+# GCN-LABEL: name: series_of_buffer_load_dword_lds_ds_read_soft_wait_repeat
+# GCN:      BUFFER_LOAD_DWORD_LDS_IDXEN
+# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
+# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
+# GCN-NEXT: S_WAITCNT 3953
+#                     vmcnt(1)
+# GCN-NEXT: S_BARRIER
+# GCN-NEXT: S_WAITCNT 3952
+#                     vmcnt(0)
+# GCN-NEXT: DS_READ_B32_gfx9
+---
+name: series_of_buffer_load_dword_lds_ds_read_soft_wait_repeat
+body:             |
+  bb.0:
+    $m0 = S_MOV_B32 0
+    BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 0, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison`), (store (s32) into `ptr addrspace(3) poison`)
+    BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
+    BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 8, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 8), (store (s32) into `ptr addrspace(3) poison` + 8)
+    S_WAITCNT_VMCNT_LDS_DMA_soft 3953
+    S_WAITCNT_VMCNT_LDS_DMA_soft 3953
+    S_BARRIER
+    S_WAITCNT_VMCNT_LDS_DMA_soft 3953
+    S_WAITCNT_VMCNT_LDS_DMA_soft 3953
+    $vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
+    S_ENDPGM 0
+
+...
+
+# Merge waitcnt.
+# GCN-LABEL: name: series_of_buffer_load_dword_lds_ds_read_soft_wait_merge
+# GCN:      BUFFER_LOAD_DWORD_LDS_IDXEN
+# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
+# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
+# GCN-NEXT: S_WAITCNT 3953
+#                     vmcnt(1)
+# GCN-NEXT: S_BARRIER
+# GCN-NEXT: S_WAITCNT 3952
+#                     vmcnt(0)
+# GCN-NEXT: DS_READ_B32_gfx9
+---
+name: series_of_buffer_load_dword_lds_ds_read_soft_wait_merge
+body:             |
+  bb.0:
+    $m0 = S_MOV_B32 0
+    BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 0, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison`), (store (s32) into `ptr addrspace(3) poison`)
+    BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
+    BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 8, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 8), (store (s32) into `ptr addrspace(3) poison` + 8)
+    S_WAITCNT_VMCNT_LDS_DMA_soft 3954
+    S_WAITCNT_VMCNT_LDS_DMA_soft 3953
+    S_BARRIER
+    S_WAITCNT_VMCNT_LDS_DMA_soft 3952
+    S_WAITCNT_VMCNT_LDS_DMA_soft 3952
+    $vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
+    S_ENDPGM 0
+
+...
+
+
+# Handle the preexisting waitcnt.
+# GCN-LABEL: name: series_of_buffer_load_dword_lds_ds_read_soft_wait_preexisting
+# GCN:      BUFFER_LOAD_DWORD_LDS_IDXEN
+# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
+# GCN-NEXT: S_WAITCNT 0
+# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
+# GCN-NEXT: S_BARRIER
+# GCN-NEXT: S_WAITCNT 3952
+#                     vmcnt(0)
+# GCN-NEXT: DS_READ_B32_gfx9
+---
+name: series_of_buffer_load_dword_lds_ds_read_soft_wait_preexisting
+body:             |
+  bb.0:
+    $m0 = S_MOV_B32 0
+    BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 0, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison`), (store (s32) into `ptr addrspace(3) poison`)
+    BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
+    S_WAITCNT 0
+    BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 8, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 8), (store (s32) into `ptr addrspace(3) poison` + 8)
+    S_WAITCNT_VMCNT_LDS_DMA_soft 3953
+    S_WAITCNT_VMCNT_LDS_DMA_soft 3953
+    S_BARRIER
+    S_WAITCNT_VMCNT_LDS_DMA_soft 3953
+    S_WAITCNT_VMCNT_LDS_DMA_soft 3953
+    $vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
+    S_ENDPGM 0
+
+...

github-actions · 2025-05-07T05:11:57Z

✅ With the latest revision this PR passed the C/C++ code formatter.

arsenm · 2025-05-07T05:36:44Z

llvm/test/CodeGen/AMDGPU/lds-dma-waitcnt.mir

+    S_BARRIER
+    S_WAITCNT_VMCNT_LDS_DMA_soft 3953
+    S_WAITCNT_VMCNT_LDS_DMA_soft 3953
+    $vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)


Suggested change

$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)

$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32), addrspace 3)

Should fix all the MMOs to have address space directly and drop the filler poison values

arsenm · 2025-05-07T05:37:24Z

llvm/lib/Target/AMDGPU/SOPInstructions.td

  def S_WAIT_DSCNT_soft : SOPP_Pseudo <"s_soft_wait_dscnt", (ins s16imm:$simm16), "$simm16">;
  def S_WAIT_KMCNT_soft : SOPP_Pseudo <"s_soft_wait_kmcnt", (ins s16imm:$simm16), "$simm16">;
 }
+def S_WAITCNT_VMCNT_LDS_DMA_soft : SOPP_Pseudo <"s_soft_waitcnt" , (ins SWaitCnt:$simm16), "$simm16">;


Document this

Yes please! What is it? How does it interact with S_WAITCNT_VMCNT? Why does SIMemoryLegalizer want to start emitting it?

Can we avoid baking "DMA" into the name? I don't think that terminology has been used since GFX9. Later documents just talk about "load to LDS".

I would like to see the follow-up patch in which SIMemoryLegalizer emits those before this lands, can you create a stacked PR with the MemoryLegalizer changes @kerbowa ?
I don't understand why we need these now, and why it's needed for correctness

This patch adds support for optimizing `S_WAITCNT_VMCNT_LDS_DMA_soft` pseudo instructions by analyzing whether they can be removed based on the absence of LDS DMA operations. These optimizations are a precursor to a dependent patch where these waitcnt pseudos will actually be emitted by the memory legalizer. Adding the waitcnt in the memory model first without any optimization would be too painful of a performance penalty.

ssahasra · 2025-06-18T03:35:44Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

+      // operations with direct LDS loads that use vmcnt. We can safely relax
+      // them when no outstanding direct LDS loads exist, even if other vmcnt
+      // events are pending.
+      if (II.getOpcode() == AMDGPU::S_WAITCNT_DIRECT_LDS_LOAD_soft &&


Why does the opcode need to identify specifically direct loads to LDS? We can do it with just S_WAITCNT, right? As far as I can see, the memory model is this: release fences must wait for any direct loads to LDS if those fences carry the LDS address space. Then the opcode should be "AMDGPU::S_WAITCNT_LDS". It's more interesting that actually we only need to do this on release fences and not on pure acquire fences.

In the waitcnt pass we are not really looking at fences and atomics directly. The idea is to rely on the "MemoryLegalizer" pass to insert soft waitcnt which can be optimized. If they are just regular waitcnt how can I differentiate from other "soft" vmcnt(0) that are added for reasons besides direct load to LDS?

Yes, that's a problem with the current separation of concerns between the memory legalizer and the waitcnt inserter. When the inserter sees a vmcnt(0), it is unsafe to relax it without complete knowledge of why it was inserted. There are two ways we could fix this:

Make the legalizer safe by default, which means it always inserts a zero count wherever a wait of any kind is required. Then make sure that the inserter can track every relevant operation including buffer invalidates. It can then safely relax the zero count based on its scores.

Make the legalizer aggressive by default, which means it always inserts a ~0 count wherever a wait of any kind is required. Then make sure the inserter does the opposite, which is to put a non-trivial count based on scores.

In either case, we only ever need one S_WAITCNT_soft opcode for everything, because the inserter has complete knowledge of all operations that depend on a wait.

I am very much in favour of option 1. But for now, we could maybe modify the legalizer so that it puts a vmcnt(~0) at any workgroup-scoped S_WAITCNT_soft. Then the inserter can change it to vmcnt(0) if it sees direct loads preceding such a soft waitcnt.

Yes, that's a problem with the current separation of concerns between the memory legalizer and the waitcnt inserter.

I don't understand why you call this a "problem". Maybe it doesn't work in exactly the way you would like, but I think it does work perfectly well.

The legalizer is safe by default, which means it always inserts a zero count wherever a wait of any kind is required. The inserter can safely relax the zero count based on its scores, but this is implemented without the inserter having to know anything about fences/invalidates/etc specifically; instead, everything the inserter needs to know is in the waitcnt instruction itself.

For most waitcnts, no extra information is needed; it's just a regular waitcnt. For the new case in this patch, the extra information is "this is a wait on vmcnt but I'm only interested in VMEM load-to-lds instructions".

Right. It's good to know that the legalizer is meant to be safe by default. The problem is that the waitcnt inserter does not have complete knowledge when it relaxes the waitcnt. (This is besides the fact that the inserter is also supposed to restore legality between a load and its use, or a store and its operands, etc). In particular, the legalizer will insert vmcnt(0) after an invalidate, but there is an explicit comment in the inserter which says that it should not be aware of invalidates. That's what makes things difficult. The inserter can safely relax a vmcnt(0) if it also correctly tracks the invalidates. That's the part that is a problem right now.

Pierre-vh

This hasn't fully clicked for me yet, so bear with me a bit.

Do I understand correctly that the end goal here to change the rule of synchronization to also include global->lds direct loads, so that they cannot reorder with normal LDS load?

Why do we suddenly need to do that ? Is this a tailored change for a specific case?
I'd like to see the reasoning in memory model terms as to why global->lds loads should be considered as normal loads and fall under the usual synchronize-with rules (and thus a new wait is needed).

Pierre-vh · 2025-06-18T08:27:25Z

llvm/lib/Target/AMDGPU/SOPInstructions.td

+// and their relation to these direct LDS loads. For example, if global loads
+// to LDS are mixed with global loads not writing to LDS, a wait may only be
+// necessary for the LDS-writing loads to synchronize with other LDS operations.
+def S_WAITCNT_DIRECT_LDS_LOAD_soft : SOPP_Pseudo <"s_soft_waitcnt" , (ins SWaitCnt:$simm16), "$simm16">;


This instruction is named S_WAITCNT_DIRECT_LDS_LOAD_soft but we don't have a S_WAITCNT_DIRECT_LDS_LOAD. I don't think it's right to call this a "soft" waitcnt. It should have another name, IMO.

ssahasra · 2025-06-19T08:31:28Z

Why do we suddenly need to do that ? Is this a tailored change for a specific case? I'd like to see the reasoning in memory model terms as to why global->lds loads should be considered as normal loads and fall under the usual synchronize-with rules (and thus a new wait is needed).

A global->lds load includes a store to LDS, which may be accessed by a subsequent DS_LOAD from a different wave. At a release operation (could be a barrier or fence or atomic store release), we need to ensure that all prior stores are finished prior to the release, including these direct stores to LDS. We already have a vmcnt(0) at global and device scopes, but not at workgroup scope. We need to insert a vmcnt at workgroup scope, but as an optimization, do it only when the release includes LDS, and there are pending LDS stores prior to the release. This set of patches is trying to model such a vmcnt using a new opcode.

EDIT: In case you are wondering, global->lds loads update vmcnt (or LOAD_CNT in SIInsertWaitcnts), which is why we need all this logic. Otherwise we could have just simulataneously modelled them as LDS stores and global loads.

ssahasra

Adding a new opcode just capture direct load to LDS is not necessary for this to work. In the memory model, a direct load to LDS is simply a global load followed by a dependent LDS store. There is ample opportunity in the memory legalizer and the wait count inserter to optimally compute the vmcnt at workgroup scope without needing more opcodes.

kerbowa requested review from Pierre-vh, arsenm, jayfoad, rampitec and ssahasra May 7, 2025 05:10

kerbowa marked this pull request as ready for review May 7, 2025 05:10

llvmbot added the backend:AMDGPU label May 7, 2025

kerbowa force-pushed the users/kerbowa/direct-lds-load-memory-model-waits branch from 62ed5c2 to 1baa896 Compare May 7, 2025 05:12

arsenm reviewed May 7, 2025

View reviewed changes

kerbowa force-pushed the users/kerbowa/direct-lds-load-memory-model-waits branch from 1baa896 to 551d49c Compare May 29, 2025 19:28

kerbowa mentioned this pull request May 29, 2025

[AMDGPU] Handle direct loads to LDS in memory model #142018

Closed

ssahasra reviewed Jun 18, 2025

View reviewed changes

Pierre-vh reviewed Jun 18, 2025

View reviewed changes

ssahasra requested changes Jul 2, 2025

View reviewed changes

This was referenced Jul 25, 2025

[AMDGPU] introduce S_WAITCNT_FENCE_soft emitted by memory legalizer #150167

Closed

[AMDGPU] introduce S_WAITCNT_LDS_DIRECT in the memory legalizer #150887

Merged

kerbowa closed this Jul 31, 2025

kerbowa deleted the users/kerbowa/direct-lds-load-memory-model-waits branch July 31, 2025 15:12

	$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
	$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32), addrspace 3)

[AMDGPU] Optimize LDS DMA soft waitcnt #138802

[AMDGPU] Optimize LDS DMA soft waitcnt #138802

Uh oh!

Conversation

kerbowa commented May 7, 2025

Uh oh!

kerbowa commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented May 7, 2025

Uh oh!

github-actions bot commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pierre-vh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ssahasra commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssahasra left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

kerbowa commented May 7, 2025 •

edited

Loading

github-actions bot commented May 7, 2025 •

edited

Loading

ssahasra commented Jun 19, 2025 •

edited

Loading